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Clocks 


Boot clock < maximum clock. 


dmesg: 
1 nouveau 
2 nouveau 
3 nouveau 
4 nouveau 


mmm 


CLK] [0000:01:00.0] 
CLK] [0000:01:00.0] 
CLK] [0000:01:00.0] 
CLK] [0000:01:00.0] 


: core 135 MHz 


core 405 MHz 


: core 606 MHz 


core 405 MHz 


shader 
shader 
shader 
shader 


270 MHz memory 135 MHz 
810 MHz memory 405 MHz 


1468 MHz memory 790 MHz 


810 MHz memory 405 MHz 
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Challenge 


Challenge - draw me a red triangle 


10110101 
10101001 
00110... 
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Challenge - draw me a red triangle 
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00110... 
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Challenge - draw me a red triangle 


10110101 
10101011 
00110... 
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Challenge - draw me a red triangle 


10110101 
10101011 
01001... 
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Challenge - change my clocks 


10110101 
10101001 
00110... 
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Process 


Step 1: Understand the clock tree 


Process - clock tree 


Prior work: cycle counters 

» Measure speed of clocks 

> Observe result of clock register changes 
> Map register to clock 


» Map clock to VBIOS entry 


Read-out tool 'nvatiming' in Envytools" 


lhttp://github. com/envytools/envytools 


Process - clock tree 


VBIOS contains lots of parameters 


CONDNEWNHEH 


000cfb0 
000cfcO 
000cfd0 
000cfe0 
000cff£0 
000d000 
000d010 
000d020 
* 

0004040 
0004050 
0004060 
0004070 
0004080 
0004090 
000d0a0 
00040b0 


c35c 
0007 
9300 
001b 
0000 
0064 
1500 
0000 


0000 
0000 
1c00 
432a 
££00 
0000 
107f 
0000 


e6ff 
0100 
ff19 
cb00 
4007 
0195 
0001 
0000 


0000 
0a96 
0002 
1000 
0000 
0000 
9000 
££00 


40c3 
1e00 
0000 
0000 
0055 
2a00 
80ca 
0000 


0000 
0055 
0195 
0508 
0000 
0200 
109f 
0000 


1012 
0305 
8700 
0115 
0000 
0043 
0000 
0000 


0000 
0000 
1500 
080a 
0000 
107£ 
f210 
0000 


0903 
5040 
0000 
8700 
0000 
0195 
0000 
0000 


000£ 
025e 
0001 
0004 
0000 
9000 
f££00 
0000 


0004 
0000 
010e 
0080 
0000 
9500 
0000 
0000 


0064 
bc00 
810e 
0000 
££00 
109f 
0000 
0000 


££00 
0000 
8700 
0000 
0896 
0001 
££00 
0000 


0000 
0005 
9500 
c700 
0000 
f310 
0000 
££00 


0001 
0000 
0000 
0000 
0055 
0195 
0000 
0000 


0000 
0316 
0001 
0e00 
0000 
0200 
0000 
0000 


naa 


Process - clock tree 


PM_Mode table at Oxcfb5. Version 4.0. RamCFG 0x2. Info_length 16. 
Subentry length 3. Subentry count 9. Subentry Offset 16 


Oxcfb5: 40 12 10 03 09 04 00 00 ff 01 00 07 00 00 01 00 ie 05 


-- ID 0x3 Voltage entry 80 PCIe link 2.5 GT/s-- 


Oxcfc7: 03 40 50 00 00 00 00 00 00 00 93 19 ff 00 00 00 


87 
Oe 
87 
1b 
cb 
15 
87 
00 
00 


Envytools* 

1 

2 

3 Boot perflvl: 0x7 

4 Header: 

5 

6 4 performance levels 

T 

8 

9 

10 0:0xcfd7: 
11 1:0xcfda: 
12 2:Oxcfdd: 
13 3:0xcfe0: 
14 4:0xcfe3: 
15 5:0xcfe6: 
16 6:0xcfe9: 
17 7:0xcfec: 
18 8:Oxcfef: 
19 EE] 


00 
01 
00 
00 
00 
01 
80 
00 
00 


00 
00 
00 
00 
00 
00 
00 
00 
00 


: core freg = 135 MHz 


: shader freq = 270 MHz 

: memclk freg = 135 MHz 

: vdec freg = 27 MHz 

: unka0 freq = 203 MHz 

: host freq = 277 MHz 

: core intm freq = 135 MHz 
: unk_engine freq = 0 MHz 
: unk_engine freq = 0 MHz 


(force no PLL) 


naa 


Process 


Step 2: Generate parameters 


Process - parameters 


Generate 


» RAM timings 


» Other reliability features - ODT, DLL 
> Link-training 


> ... And a whole load of unknown bits 


Strategy: mimic the blob 


Process - parameters 


MMIOTrace: Intercept communication between blob and device. 


ovosouPrwmH 


o0000000000000000000 


mo m m N N N N S N EN N ES SNN EN 


0x001538 
0x001538 
0x0041a4 
0x00e8a4 
0x0041a4 
0x0041a4 
0x0041a4 
0x00e8a4 
0x00c040 
0x00c040 
0x004160 
0x00e8a4 
0x004120 
0x00e8a4 
0x00e8a4 
0x00e8a4 
0x00e8a4 
0x004164 
0x00e8a4 
0x004124 


0x00011111 
0x00011111 
0x00063131 
0x05063c01 
0x00063131 
0x00043131 
0x00043131 
0x05063c01 
0x20001000 
0x20001000 
0x00063131 
0x05063c01 
0x00063031 
0x05063c01 
0x05063c01 
0x05063c01 
0x05063c01 
0x00023030 
0x05063c01 
0x00063131 


PBUS+0x538 => 0x11111 
PBUS+0x538 <= 0x11111 


PCLOCK.VDCLK => { VCO_ENABLE | 


PNVIO.RPLL2.COEF => 
PCLOCK.VDCLK => 
PCLOCK.VDCLK <= 
PCLOCK.VDCLK <= 
PNVIO.RPLL2.COEF => 
PCONTROL.MASTER => { 
PCONTROL.MASTER <= { 


PCLOCK.NVCLK => { VCO_ENABLE | 
PNVIO.RPLL2.COEF = 
PCLOCK.NVPLL_SRC = 


PNVIO.RPLL2. 
PNVIO.RPLL2. 
PNVIO.RPLL2. 
PNVIO.RPLL2. 


PCLOCK. SCLK 


PNVIO.RPLL2. 


PCLOCK. SPLL. 


COEF = 
COEF = 
COEF = 
COEF = 
vco_ 


=> { 
COEF 


SRC => { VCO_ENABLE | VCO_SRC = RPLL2 | OUTPUT_1 = 


=> 


{ 
{ 


VCO_SRC = RPLL2 | OUTPUT_1 = 100000 | ENABI 
M = Ox1 = Ox3c | UNK16 = 0x6 | UNK28 = 0x5 } 
{ VCO_ENABLE | VCO_SRC = RPLL2 | OUTPUT_1 = 100000 | ENABI 
{ VCO_ENABLE | VCO_SRC = RPLL2 | OUTPUT_1 = 100000 | ENABI 
{ VCO_ENABLE | VCO_SRC = RPLL2 | OUTPUT_1 = 100000 | ENABI 
M = 0x1 | N = Ox3c | UNK16 = 0x6 | UNK28 = 0x5 3 
UNK12 | HOST = 277MHz } 
UNK12 | HOST = 277MHz } 
VCO_SRC = RPLL2 | OUTPUT_1 = 100000 | ENABI 
M = 0x1 | N = Ox3c | UNK16 = 0x6 | UNK28 = 0x5 3 
{ VCO_ENABLE | VCO_SRC = RPLL2 | OUTPUT_1 = 100000 | ( 
M = 0x1 | N = Ox3c | UNK16 = 0x6 | UNK28 = 0x5 3 
M = 0x1 | N = Ox3c | UNK16 = 0x6 | UNK28 = 0x5 3 
M = 0x1 | N = Ox3c | UNK16 = 0x6 | UNK28 = 0x5 3 
M = 0x1 | N = Ox3c | UNK16 = 0x6 | UNK28 = 0x5 3 
SRC = RPLL2 | OUTPUT_1 = 100000 | OUTPUT = VCO | VCO. 
M = Ox1 = Ox3c | UNK16 = 0x6 | UNK28 = 0x5 } 


100000 | El 


DAG 


Process - parameters 


SEQ 


> Script language used for changing memory clock 
> complete ISA 
» Read/write registers 
Read/write memory 
Sleep 
Arithmetic 
(Conditional) Branching 


Yv yv vvv 


> Runs on the GPU, bus can be cut off 


Process 


Step 3: Execute 


El 5 = = = nac 


Process - Execute 


» Reclock memory 
> "Prepare" 
Wait for VBLANK 
Pause engines 
Set clock 
Set params 
Reset clock-dependent subcomponents (DLL) 
Resume engines 


Wil E o, SI E, | 


M 


Disable interrupts 


v 


Pause engines 
Change other clocks + voltages (GPIO) 


> Resume engines 


v 


Process 


Step 4: Generalise 


El =) = = = nac 
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Results 


Results - Test system 


> AMD FX-6300, 6-core © 3,5GHz 
> 8GB DDR3 © 1,83GHz 
» NVIDIA GPU 


> GT310, DDR3 
> GT240, GDDR3 


> Fedora 20 x86 64 
> Mesa 10.1.5 


> xorg-x11-drv-nouveau 1.0.9-2 


AAG 


Results - Portal 


Portal ksh 
Resolution: 1280 x 1024 D U 


> Frames Per Second, More Is Better PHORONIX-TEST-SUITE.COM 


L 
G310 405MHz/405MHz Elen 
SE +/- 0.02 : 
+ 
G310 589MHz/790MHz 
34.29 
SE +/- 0.03 
+ 


SE +/- 0.06 å 
+ 
SE +/- 0.20 


101.57 
l T 
20 40 60 80 100 


Phoronix Test Suite 5.2.1 


Results - Xonotic 


Xonotic v0.7 
Resolution: 1280 x 1024 - Effects Quality: Ultimate 


P Frames Per Second, More Is Better 


L 
G310 405MHz/405MHz rāda 
SE +/- 0.04 
+ 
G310 589MHz/790MHz IAE 
12.35 
SE +/-0.01 
+ 


GT240 405MHz/324MHz MEE 


SE +/- 0.55 
GT240 575MHz/1000MHz PLEASE 


SE +/- 1.62 
L 


ptsk 


PHORONIX-TEST-SUITE.COM 


Phoronix Test Suite 5.2.1 
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Future work 


Future work 


» Reliability 


» GPU doesn't pause nicely, changing clocks hangs under stress 
> Some DDR2 cards show corruption in highest perf lvl, 
bandwidth problem? 


» Link training should be done on boot, now postponed 
> GDDR5 reclocking 


> Load-based automatic clock/voltage adjustment 
> Fermi, Kepler, Maxwell 


AAG 


Conclusion 


> Clock scaling mandatory for speed 
> Understandable process 


1. Understand the clock tree 


2. Generate parameters 
3. Execute 


4. Generalise 


» Very rewarding in terms of FPS! 


Code drops soon in a kernel near you... Or try 


https://github. com/RSpliet/kernel-nouveau-nva3-pm 


Oh, one more thing... 


Hire mel 


See http://roy.spliet.org 


Questions 


Questions? 


